Problem Statement: Concrete Strength Prediction Objective To predict the concrete strength using the data available in file concrete_data.xls. Apply feature engineering and model tuning to obtain 80% to 95% of R2score.
Resources Available The data for this project is available in file https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/. The same has been shared along with the course content.
Steps and Tasks: ï‚· Exploratory data quality report reflecting the following:
Bi-variate analysis between the predictor variables and between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms or density curves. (10 marks)
Feature Engineering techniques (10 marks) a. Identify opportunities (if any) to extract a new feature from existing features, drop a feature (if required) b. Get data model ready and do a train test split. c. Decide on complexity of the model, should it be simple linear model in terms of parameters or would a quadratic or higher degree. ï‚· Creating the model and tuning it
Attribute Information:
Given are the variable name, variable type, the measurement unit and a brief description. The concrete compressive strength is the regression problem. The order of this listing corresponds to the order of numerals along the rows of the database.
Name -- Data Type -- Measurement -- Description
ï‚· Cement (cement) -- quantitative -- kg in a m3 mixture -- Input Variable
ï‚· Blast Furnace Slag (slag) -- quantitative -- kg in a m3 mixture -- Input Variable
ï‚· Fly Ash (ash) -- quantitative -- kg in a m3 mixture -- Input Variable
ï‚· Water (water) -- quantitative -- kg in a m3 mixture -- Input Variable
ï‚· Superplasticizer (superplastic) -- quantitative -- kg in a m3 mixture -- Input Variable
ï‚· Coarse Aggregate (coarseagg) -- quantitative -- kg in a m3 mixture -- Input Variable
ï‚· Fine Aggregate (fineagg) -- quantitative -- kg in a m3 mixture -- Input Variable
ï‚· Age(age) -- quantitative -- Day (1~365) -- Input Variable
ï‚· Concrete compressive strength(strength) -- quantitative -- MPa -- Output Variable
#!pip install tensorflow
#!pip install catboost
#!pip install eli5
# available matplotlib - ['seaborn-deep','seaborn-muted','bmh','seaborn-white','dark_background','seaborn-notebook','seaborn-darkgrid','grayscale','seaborn-paper','seaborn-talk','seaborn-bright','classic','seaborn-colorblind','seaborn-ticks','ggplot','seaborn','_classic_test','fivethirtyeight','seaborn-dark-palette','seaborn-dark','seaborn-whitegrid','seaborn-pastel','seaborn-poster']
# Imports
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats; from scipy.stats import zscore, norm, randint
import matplotlib.style as style; style.use('fivethirtyeight')
from collections import OrderedDict
%matplotlib inline
# Checking Leverage and Influence Points
from statsmodels.graphics.regressionplots import *
import statsmodels.stats.stattools as stools
import statsmodels.formula.api as smf
import statsmodels.stats as stats
import scipy.stats as scipystats
import statsmodels.api as sm
# Checking multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
from patsy import dmatrices
# Cluster analysis
from sklearn.cluster import KMeans
# Feature importance
import eli5
from eli5.sklearn import PermutationImportance
# Modelling
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, ExtraTreesRegressor, BaggingRegressor
from sklearn.model_selection import train_test_split, KFold, cross_val_score, learning_curve
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from catboost import CatBoostRegressor, Pool
from sklearn.svm import SVR
# Metrics
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.utils import resample
# Display settings
pd.options.display.max_rows = 400
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
# Checking if GPU is found
import tensorflow as tf
# Suppress warnings
import warnings; warnings.filterwarnings('ignore')
random_state = 2020
np.random.seed(random_state)
# Reading the data as dataframe and print the first five rows
concrete = pd.read_csv('concrete.csv')
concrete.head()
# Get info of the dataframe columns
concrete.dtypes
# Get no. of rows and columns
concrete.shape
# Describe concrete dataset
concrete.describe()
# Checking any null values
concrete.isnull().sum()
Dataset has 1030 rows and 9 columns, with no missing values.
All features are of numerical types. strength is a target variable (continuous). age is a discrete feature whereas rest of them are continuous.
# customized function to find the skewness, outliers of all columns
def cdescribe(df):
results = []
for col in df.select_dtypes(include = ['float64', 'int64']).columns.tolist():
stats = OrderedDict({'': col,
'Count': df[col].count(),
'Type': df[col].dtype,
'Mean': round(df[col].mean(), 2),
'StdDev': round(df[col].std(), 2),
'Variance': round(df[col].var(), 2),
'Minimum': round(df[col].min(), 2),
'Q1': round(df[col].quantile(0.25), 2),
'Median': round(df[col].median(), 2),
'Q3': round(df[col].quantile(0.75), 2),
'Maximum': round(df[col].max(), 2),
'Range': round(df[col].max(), 2)-round(df[col].min(), 2),
'IQR': round(df[col].quantile(0.75), 2)-round(df[col].quantile(0.25), 2),
'Kurtosis': round(df[col].kurt(), 2),
'Skewness': round(df[col].skew(), 2),
'MeanAbsDev': round(df[col].mad(), 2)})
if df[col].skew() < -1:
if df[col].median() < df[col].mean(): ske = 'Highly Skewed (Right)'
else: ske = 'Highly Skewed (Left)'
elif -1 <= df[col].skew() <= -0.5:
if df[col].median() < df[col].mean(): ske = 'Moderately Skewed (Right)'
else: ske = 'Moderately Skewed (Left)'
elif -0.5 < df[col].skew() <= 0:
if df[col].median() < df[col].mean(): ske = 'Fairly Symmetrical (Right)'
else: ske = 'Fairly Symmetrical (Left)'
elif 0 < df[col].skew() <= 0.5:
if df[col].median() < df[col].mean(): ske = 'Fairly Symmetrical (Right)'
else: ske = 'Fairly Symmetrical (Left)'
elif 0.5 < df[col].skew() <= 1:
if df[col].median() < df[col].mean(): ske = 'Moderately Skewed (Right)'
else: ske = 'Moderately Skewed (Left)'
elif df[col].skew() > 1:
if df[col].median() < df[col].mean(): ske = 'Highly Skewed (Right)'
else: ske = 'Highly Skewed (Left)'
else:
ske = 'Error'
stats['SkewnessComment'] = ske
upper_lim, lower_lim = stats['Q3'] + (1.5 * stats['IQR']), stats['Q1'] - (1.5 * stats['IQR'])
if len([x for x in df[col] if x < lower_lim or x > upper_lim])>1:
out = 'HasOutliers'
else:
out = 'NoOutliers'
stats['OutliersComment'] = out
results.append(stats)
if df[col].median() > df[col].mean():
med_mean = "more than"
elif df[col].median() < df[col].mean():
med_mean = "less than"
else:
med_mean = "same as"
# Printi the descriptive statistics report for all the columns in concrete dataset
print(f'\n{col} - Data ranges between {round(df[col].min(),2)} to {round(df[col].max(),2)}, while 25th and 75th percentile is spread between {round(df[col].quantile(0.25),2)} to {round(df[col].quantile(0.75),2)}. Median {round(df[col].median(),2)} is {med_mean} than Mean {round(df[col].mean(),2)} which means cement is {ske}. Column has {out}.')
describe = pd.DataFrame(results).set_index('')
return display(describe.T)
cdescribe(concrete)
print('Checking Outliers using boxplot'); print('--'*60)
fig = plt.figure(figsize = (15, 8))
ax = sns.boxplot(data = concrete.iloc[:, 0:-1], orient = 'h')
def bdplots(df, col):
f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (15, 7))
# Boxplot to check outliers
sns.boxplot(x = col, data = df, ax = ax1, orient = 'v', color = 'lightblue')
# Distribution plot with outliers
sns.distplot(df[col], ax = ax2, color = 'teal', fit = norm, rug = True).set_title(f'{col} with outliers')
ax2.axvline(df[col].mean(), color = 'r', linestyle = '--', label = 'Mean', linewidth = 1.2)
ax2.axvline(df[col].median(), color = 'g', linestyle = '--', label = 'Median', linewidth = 1.2)
ax2.axvline(df[col].mode()[0], color = 'b', linestyle = '--', label = 'Mode', linewidth = 1.2); ax2.legend(loc = 'best')
# Removing outliers, but in a new dataframe
upperbound, lowerbound = np.percentile(df[col], [1, 99])
y = pd.DataFrame(np.clip(df[col], upperbound, lowerbound))
# Distribution plot without outliers
sns.distplot(y[col], ax = ax3, color = 'tab:orange', fit = norm, rug = True).set_title(f'{col} without outliers')
ax3.axvline(y[col].mean(), color = 'r', linestyle = '--', label = 'Mean', linewidth = 1.2)
ax3.axvline(y[col].median(), color = 'g', linestyle = '--', label = 'Median', linewidth = 1.2)
ax3.axvline(y[col].mode()[0], color = 'b', linestyle = '--', label = 'Mode', linewidth = 1.2); ax3.legend(loc = 'best')
kwargs = {'fontsize':14, 'color':'blue'}
ax1.set_title(col + ' Boxplot Analysis', **kwargs)
ax1.set_xlabel('Box', **kwargs)
ax1.set_ylabel(col + ' Values', **kwargs)
return plt.show()
print('Boxplot, distribution of columns with and without outliers'); print('--'*60)
columns = list(concrete.columns)[:-1]
for i in columns:
Q3 = concrete[i].quantile(0.75)
Q1 = concrete[i].quantile(0.25)
IQR = Q3 - Q1
no_outlier = len(concrete.loc[(concrete[i] < (Q1 - 1.5 * IQR)) | (concrete[i] > (Q3 + 1.5 * IQR))])
print(f'{i.capitalize()} column \nNumber of rows with outliers: {no_outlier}')
# print the outlier rows
if (no_outlier > 0):
display(concrete.loc[(concrete[i] < (Q1 - 1.5 * IQR)) | (concrete[i] > (Q3 + 1.5 * IQR))].head(no_outlier))
bdplots(concrete, i)
del i, Q1, Q3, IQR, columns, no_outlier
# Outliers removal
def replace_outliers(df, col, method = 'quantile', strategy = 'median', drop = True):
if method == 'quantile':
Q3, Q2, Q1 = df[col].quantile([0.75, 0.50, 0.25])
IQR = Q3 - Q1
upper_lim = Q3 + (1.5 * IQR)
lower_lim = Q1 - (1.5 * IQR)
print(f'Outliers for {col} are: {sorted([x for x in df[col] if x < lower_lim or x > upper_lim])}\n')
if strategy == 'median':
df.loc[(df[col] < lower_lim) | (df[col] > upper_lim), col] = Q2
else:
df.loc[(df[col] < lower_lim) | (df[col] > upper_lim), col] = df[col].mean()
elif method == 'stddev':
col_mean, col_std, Q2 = df[col].mean(), df[col].std(), df[col].median()
cut_off = col_std * 3
lower_lim, upper_lim = col_mean - cut_off, col_mean + cut_off
print(f'Outliers for {col} are: {sorted([x for x in df[col] if x < lower_lim or x > upper_lim])}\n')
if strategy == 'median':
df.loc[(df[col] < lower_lim) | (df[col] > upper_lim), col] = Q2
else:
df.loc[(df[col] < lower_lim) | (df[col] > upper_lim), col] = col_mean
else:
print('Please pass the correct method, strategy or drop criteria')
# Replacing outliers with mean values with outliers
print('Replacing outliers with mean values using quantile method'); print('--'*62)
concrete1 = concrete.copy(deep = True)
outliers_cols = ['slag', 'water', 'superplastic', 'fineagg', 'age']
for col in outliers_cols:
replace_outliers(concrete1, col, method = 'quantile', strategy = 'mean')
print('\nColumn for which outliers where replaced with mean using quantile method: \n', outliers_cols)
print('With Outliers'); print('--'*62); display(concrete[outliers_cols].describe().T)
print('\nWihtout Outliers'); print('--'*62); display(concrete1[outliers_cols].describe().T)
A quick observation after imputating the missing values: medians remain unchanged while mean changes slightly not significantly. Type of skewness remain unchanged.
# Cheking the outliers graphically after removal
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(data = concrete1.iloc[:, 0:-1], orient = 'h')
sns.pairplot(concrete1, diag_kind = 'kde')
Cement and strength have a linear relationship.
Column that have bi/multimodal distributions are slag, ash and superplastic.
for col in list(concrete1.columns)[:-2]:
fig, ax1 = plt.subplots(figsize = (9, 6.5), ncols = 1, sharex = False)
sns.regplot(x = concrete1[col], y = concrete1['strength'], ax = ax1).\
set_title(f'Understanding relation between {col}, strength')
Reference for carrying out this analysis: https://songhuiming.github.io/pages/2016/12/31/linear-regression-in-python-chapter-2/
Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients.
Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.
A studentized residual is calculated by dividing the residual by an estimate of its standard deviation. The standard deviation for each residual is computed with the observation excluded. For this reason, studentized residuals are sometimes referred to as externally studentized residuals.
lm = smf.ols(formula = 'strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age', data = concrete1).fit()
print(lm.summary())
influence = lm.get_influence()
resid_student = influence.resid_studentized_external
(cooks, p) = influence.cooks_distance
(dffits, p) = influence.dffits
leverage = influence.hat_matrix_diag
print('\n')
print('Leverage v.s. Studentized Residuals')
fig = plt.figure(figsize = (15, 7.2))
sns.regplot(leverage, lm.resid_pearson, fit_reg = False)
concrete1_res = pd.concat([pd.Series(cooks, name = 'cooks'), pd.Series(dffits, name = 'dffits'), pd.Series(leverage, name = 'leverage'), pd.Series(resid_student, name = 'resid_student')], axis = 1)
concrete1_res = pd.concat([concrete1, concrete1_res], axis = 1)
concrete1_res.head()
# Studentized Residual
print('Studentized residuals as a first means for identifying outliers'); print('--'*60)
r = concrete1_res.resid_student
print('-'*30 + ' studentized residual ' + '-'*30)
display(r.describe())
print('\n')
r_sort = concrete1_res.sort_values(by = 'resid_student', ascending = True)
print('-'*30 + ' top 5 most negative residuals ' + '-'*30)
display(r_sort.head())
print('\n')
r_sort = concrete1_res.sort_values(by = 'resid_student', ascending = False)
print('-'*30 + ' top 5 most positive residuals ' + '-'*30)
display(r_sort.head())
We shall pay attention to studentized residuals
(high) more than +2 or less than -2
(medium high) more than +2.5 or less than -2.5
(very high) more than +3 or less than -3
print('Studentized Residual more than +2 or less than -2'); print('--'*60)
res_index = concrete1_res[abs(r) > 2].index
print(res_index)
print('Let\'s look at leverage points to identify observations that will have potential great influence on reg coefficient estimates.'); print('--'*60)
print('A point with leverage greater than (2k+2)/n should be carefully examined, where k is the number of predictors and n is the number of observations. In our example this works out to (2*8+2)/1030 = .017476')
leverage = concrete1_res.leverage
print('-'*30 + ' Leverage ' + '-'*30)
display(leverage.describe())
print('\n')
leverage_sort = concrete1_res.sort_values(by = 'leverage', ascending = False)
print('-'*30 + ' top 5 highest leverage data points ' + '-'*30)
display(leverage_sort.head())
print('Printing indexes where leverage exceeds +0.017476 or -0.017476'); print('--'*60)
lev_index = concrete1_res[abs(leverage) > 0.017476].index
print(lev_index)
print('Let\'s take a look at DF-FITS. The conventional cut-off point for DF-FITS is 2*sqrt(k/n).')
print('DF-FITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence.'); print('--'*60)
import math
dffits_index = concrete1_res[concrete1_res['dffits'] > 2 * math.sqrt(8 / 1030)].index
print(dffits_index)
set(res_index).intersection(lev_index).intersection(dffits_index)
print('Let\'s run the regression again without 452 and 469 row'); print('--'*60)
concrete1.drop([452, 469], axis = 0, inplace = True)
print(concrete1.shape)
lm1 = smf.ols(formula = 'strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age', data = concrete1).fit()
print(lm1.summary())
corr = concrete1.corr()
mask = np.array(corr)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(12,7)
sns.heatmap(corr, mask=mask,vmax=.9, square=True,annot=True, cmap="BuPu")
# Absolute correlation of independent variables with the target variable
absCorrwithDep = []
allVars = concrete1.drop('strength', axis = 1).columns
for var in allVars:
absCorrwithDep.append(abs(concrete1['strength'].corr(concrete1[var])))
display(pd.DataFrame([allVars, absCorrwithDep], index = ['Variable', 'Correlation']).T.\
sort_values('Correlation', ascending = False))
None of the columns have a correlation above a threshold and thus none to be dropped.
age, cement and superplastic are some of the columns that have strong influence over target variable.
Performing feature engineering on the cement dataset. Objective here would be:
Explore for gaussians. If data is likely to be a mix of gaussians, explore individual clusters and present your findings in terms of the independent attributes and their suitability to predict strength
Identify opportunities (if any) to create a composite feature, drop a feature
Decide on complexity of the model
Feature Engineering Guassians Ref: https://www.kaggle.com/kenmatsu4/feature-engineering-with-gaussian-process
### Feature Engineering
concrete1.reset_index(inplace = True, drop = True)
X = concrete1.drop('strength', axis = 1)
y = concrete1['strength']
labels = KMeans(2, random_state = 0).fit_predict(X)
# KMeans Plots
def kplots(df, ocol):
columns = list(set(list(df.columns))-set([ocol]))
f, ax = plt.subplots(4, 2, figsize = (15, 17))
ax[0][0].scatter(X[ocol], X[columns[0]], c = labels, s = 20, cmap = 'viridis'); ax[0][0].set_xlabel(ocol); ax[0][0].set_ylabel(columns[0])
ax[0][1].scatter(X[ocol], X[columns[1]], c = labels, s = 20, cmap = 'viridis'); ax[0][1].set_xlabel(ocol); ax[0][1].set_ylabel(columns[1])
ax[1][0].scatter(X[ocol], X[columns[2]], c = labels, s = 10, cmap = 'viridis'); ax[1][0].set_xlabel(ocol); ax[1][0].set_ylabel(columns[2])
ax[1][1].scatter(X[ocol], X[columns[3]], c = labels, s = 10, cmap = 'viridis'); ax[1][1].set_xlabel(ocol); ax[1][1].set_ylabel(columns[3])
ax[2][0].scatter(X[ocol], X[columns[4]], c = labels, s = 10, cmap = 'viridis'); ax[2][0].set_xlabel(ocol); ax[2][0].set_ylabel(columns[4])
ax[2][1].scatter(X[ocol], X[columns[5]], c = labels, s = 10, cmap = 'viridis'); ax[2][1].set_xlabel(ocol); ax[2][1].set_ylabel(columns[5])
ax[3][0].scatter(X[ocol], X[columns[6]], c = labels, s = 10, cmap = 'viridis'); ax[3][0].set_xlabel(ocol); ax[3][0].set_ylabel(columns[6])
col_list = list(concrete1.drop('strength',axis=1).columns)
for i in col_list:
print(f'\n{i} vs Other Columns Clusters'); print('**'*60)
kplots(X, i)
plt.show()
Clusters can be observed with between cement and rest of the independent variables.
Cluster at age 100 can be seen.
Let's add features based on cluster analysis we found for cement and other columns.
# Adding features based on cement clusters
print('Let\'s add features based on cluster analysis we found for cement and other columns'); print('--'*60)
concrete1 = concrete1.join(pd.DataFrame(labels, columns = ['labels']), how = 'left')
cement_features = concrete1.groupby('labels', as_index = False)['cement'].agg(['mean', 'median'])
concrete1 = concrete1.merge(cement_features, on = 'labels', how = 'left')
concrete1.rename(columns = {'mean': 'cement_labels_mean', 'median': 'cement_labels_median'}, inplace = True)
concrete1.drop('labels', axis = 1, inplace = True)
display(cdescribe(concrete1))
Check whether there exist any important feature interaction which we can make use of to create new features.
# Adding features
cement_age = concrete1.groupby('age', as_index = False)['cement'].agg(['mean', 'median'])
concrete1 = concrete1.merge(cement_age, on = 'age', how = 'left')
concrete1.rename(columns = {'mean': 'cement_age_mean', 'median': 'cement_age_median'}, inplace = True)
water_age = concrete1.groupby('age')['water'].agg(['mean', 'median']); concrete1 = concrete1.merge(water_age, on = 'age', how = 'left')
concrete1.rename(columns = {'mean': 'water_age_mean', 'median': 'water_age_median'}, inplace = True)
concrete1.describe()
# Correlation matrix
corr = concrete1.corr()
mask = np.array(corr)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(25,15)
sns.heatmap(corr, mask=mask,vmax=.9, square=True,annot=True, cmap="BuPu")
# Absolute correlation of independent variables with the target variable
absCorrwithDep = []
allVars = concrete1.drop('strength', axis = 1).columns
for var in allVars:
absCorrwithDep.append(abs(concrete1['strength'].corr(concrete1[var])))
display(pd.DataFrame([allVars, absCorrwithDep], index = ['Variable', 'Correlation']).T.\
sort_values('Correlation', ascending = False))
print('Checking if multicollinearity exists')
print('A Variable Influence Factor between 5 and 10 indicates high correlation that may be problematic. \
And if the Variable Influence Factor goes above 10, you can assume that the regression coefficients are poorly estimated \
due to multicollinearity.')
print('--'*60)
y, X = dmatrices('strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age + cement_labels_mean + cement_labels_median + cement_age_mean + cement_age_median + water_age_mean + water_age_median',
concrete1, return_type = 'dataframe')
vif = pd.DataFrame()
vif['VI Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['Features'] = X.columns
display(vif.round(1).sort_values(by = 'VI Factor', ascending = False))
age, cement, water, slag are some of the importance features based on eli5 and model based feature importance. Dropping all newly added features since they resulted in multicollinearity.
concrete1.drop(['water_age_mean', 'water_age_median', 'cement_age_mean', 'cement_labels_mean', 'cement_labels_median', 'cement_age_mean'], axis = 1, inplace = True)
concrete1.shape, concrete1.columns
print('Split into training (70%), validation(10%) and test(20%) sets for both with EDA and FE & without EDA and FE.')
print('--'*60)
# Training, validation and test sets with outliers
X = concrete.drop('strength', axis = 1); y = concrete['strength']; features_list = list(X.columns)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.12, random_state = random_state)
print(f'Shape of train, valid and test datasets without EDA, FE: {(X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape)}')
print(f'Proportion in the splits for train, valid, test datasets without EDA, FE: {round(len(X_train)/len(X), 2), round(len(X_val)/len(X), 2), round(len(X_test)/len(X), 2)}')
# Training, validation and test sets without outliers
X = concrete1.drop('strength', axis = 1); y = concrete1['strength']; features_list = list(X.columns)
X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(X, y, test_size = 0.2, random_state = random_state)
X_train_fe, X_val_fe, y_train_fe, y_val_fe = train_test_split(X_train_fe, y_train_fe, test_size = 0.12, random_state = random_state)
print(f'\nShape of train, valid and test datasets with EDA, FE: {(X_train_fe.shape, y_train_fe.shape, X_val_fe.shape, y_val_fe.shape, X_test_fe.shape, y_test_fe.shape)}')
print(f'Proportion in the splits for train, valid, test datasets with EDA, FE: {round(len(X_train_fe)/len(X), 2), round(len(X_val_fe)/len(X), 2), round(len(X_test_fe)/len(X), 2)}')
training_test_sets = {'withoutedafe': (X_train, y_train, X_val, y_val), 'withedafe': (X_train_fe, y_train_fe, X_val_fe, y_val_fe)}
print('Let\'s check cross validated scores on linear models and tree-based models on training and validation sets with and without EDA & FE')
print('--'*60)
models = []
models.append(('Linear', LinearRegression()))
models.append(('Lasso', Lasso(random_state = random_state)))
models.append(('Ridge', Ridge(random_state = random_state)))
models.append(('SVR', SVR()))
models.append(('DecisionTree', DecisionTreeRegressor(random_state = random_state)))
models.append(('GradientBoost', GradientBoostingRegressor(random_state = random_state)))
models.append(('AdaBoost', AdaBoostRegressor(random_state = random_state)))
models.append(('ExtraTrees', ExtraTreesRegressor(random_state = random_state)))
models.append(('RandomForest', RandomForestRegressor(random_state = random_state)))
models.append(('Bagging', BaggingRegressor(DecisionTreeRegressor(random_state = random_state), random_state = random_state)))
models.append(('CatBoost', CatBoostRegressor(random_state = random_state, silent = True)))
scoring = 'r2'; results = {}; score = {}
for encoding_label, (_X_train, _y_train, _X_val, _y_val) in training_test_sets.items():
scores = []; result_cv = []; names = []
for name, model in models:
kf = KFold(n_splits = 10, random_state = random_state)
cv_results = cross_val_score(model, _X_train, _y_train, cv = kf, scoring = scoring)
result_cv.append(cv_results); names.append(name)
scores.append([name, cv_results.mean().round(4), cv_results.std().round(4)])
score[encoding_label] = scores
results[encoding_label] = [names, result_cv]
print('Let\'s check the cv scores (r2) for sets without EDA and FE')
display(score['withoutedafe'])
print('\nLet\'s check the cv scores (r2) for sets with EDA and FE')
display(score['withedafe'])
pd.options.display.float_format = "{:.4f}".format
scores_df = pd.concat([pd.DataFrame(score['withoutedafe'], columns = ['Model', 'R2 (Mean) Without', 'R2 (Std) Without']).set_index('Model'),
pd.DataFrame(score['withedafe'], columns = ['Model', 'R2 (Mean) With', 'R2 (Std) With']).set_index('Model')], axis = 1)
scores_df['Improvement?'] = scores_df['R2 (Mean) With'] - scores_df['R2 (Mean) Without']
display(scores_df)
print('A significant improvement in r2 scores after EDA & FE for linear algorithms whereas remains almost same for tree-based algorithms.'); print('--'*60)
fig,(ax1, ax2) = plt.subplots(1, 2, figsize = (20, 7.2))
ax1.boxplot(results['withoutedafe'][1]); ax1.set_xticklabels(results['withoutedafe'][0], rotation = 90); ax1.set_title('CV Score - without EDA and FE')
ax2.boxplot(results['withedafe'][1]); ax2.set_xticklabels(results['withedafe'][0], rotation = 90); ax2.set_title('CV Score - with EDA and FE')
plt.show()
We see an improvement in the scores against the uncleaned data we had. Improvements are clearly seen for linear algos whereas for tree-based it either marginally increases/decreases. Tree-based algorithms are a clear choice when it comes to linear vs tree-based comparison.
# For rmse scoring
def rmse_score(y, y_pred):
return np.sqrt(np.mean((y_pred - y)**2))
scalers = {'notscaled': None, 'standardscaling': StandardScaler(), 'robustscaling': RobustScaler()}
training_test_sets = {'validation_sets': (X_train_fe, y_train_fe, X_val_fe, y_val_fe),
'test_sets': (X_train_fe, y_train_fe, X_test_fe, y_test_fe)}
# initialize model
cat_reg = CatBoostRegressor(iterations = None, eval_metric = 'RMSE', random_state = random_state, od_type = 'Iter', od_wait = 5)
# iterate over all possible combinations and get the errors
errors = {}
for encoding_label, (_X_train, _y_train, _X_val, _y_val) in training_test_sets.items():
for scaler_label, scaler in scalers.items():
scores = []
if scaler == None:
trainingset = _X_train.copy()
testset = _X_val.copy()
cat_reg.fit(trainingset, _y_train, early_stopping_rounds = 5, verbose = False, plot = False,
eval_set = [(testset, _y_val)], use_best_model = True)
pred = cat_reg.predict(testset)
rmse = rmse_score(_y_val, pred)
r2 = r2_score(_y_val, pred)
scores.append([rmse, r2])
key = encoding_label + ' - ' + scaler_label
errors[key] = scores[0]
else:
trainingset = _X_train.copy()
testset = _X_val.copy()
trainingset = scaler.fit_transform(trainingset)
testset = scaler.transform(testset)
cat_reg.fit(trainingset, _y_train, early_stopping_rounds = 5, verbose = False, plot = False,
eval_set = [(testset, _y_val)], use_best_model = True)
pred = cat_reg.predict(testset)
rmse = rmse_score(_y_val, pred)
r2 = r2_score(_y_val, pred)
scores.append([rmse, r2])
key = encoding_label + ' - ' + scaler_label
errors[key] = scores[0]
# Function to get top results from grid search and randomized search
def search_report(results):
df = pd.concat([pd.DataFrame(results.cv_results_['params']), pd.DataFrame(results.cv_results_['mean_test_score'], columns = ['r2'])], axis = 1)
return df
print('It can be seen that RMSE is lowest when robust scaling is used whereas R2 almost remains same as un-scaled data.');
print('Scaling would help to effectively use the training and validation sets across algorithms.');print('--'*60)
display(errors)
It can be seen that RMSE and R2 score is almost same for notscled or standardscaling or robustscaling,
## Helper function to train, validate and predict
def train_val_predict(basemodel, train_X, train_y, test_X, test_y, name, model):
folds = list(KFold(n_splits = 5, random_state = random_state, shuffle = True).split(train_X, train_y))
r2_scores_train = []; r2_scores_val = []; r2_scores_test = []
for j, (train_index, val_index) in enumerate(folds):
X_train = train_X.iloc[train_index]
y_train = train_y.iloc[train_index]
X_val = train_X.iloc[val_index]
y_val = train_y.iloc[val_index]
if model == 'CatBoost':
basemodel.fit(X_train, y_train, early_stopping_rounds = 5, verbose = 300, eval_set = [(X_val, y_val)], use_best_model = True)
else:
basemodel.fit(X_train, y_train)
pred = basemodel.predict(X_train)
r2 = r2_score(y_train, pred); r2_scores_train.append(r2)
pred = basemodel.predict(X_val)
r2 = r2_score(y_val, pred); r2_scores_val.append(r2)
pred = basemodel.predict(X_test_fe)
r2 = r2_score(y_test_fe, pred); r2_scores_test.append(r2)
df = pd.DataFrame([np.mean(r2_scores_train), np.mean(r2_scores_val), np.mean(r2_scores_test)],
index = ['r2 Scores Train', 'r2 Scores Val', 'r2 Scores Test'],
columns = [name]).T
return df
print('Separating the dependents and independents + Scaling the data'); print('--'*60)
features_list = list(concrete1.columns)
concrete1 = concrete1.apply(zscore); concrete1 = pd.DataFrame(concrete1 , columns = features_list)
display(concrete1.describe())
X = concrete1.drop('strength', axis = 1); y = concrete1['strength'];
X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(X, y, test_size = 0.2, random_state = random_state)
X_train_fe.shape, X_test_fe.shape, y_train_fe.shape, y_test_fe.shape
print('Using the 5-Fold Linear Regression to train, validate and predict'); print('--'*60)
lr_reg = LinearRegression()
df_lr = train_val_predict(lr_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold LinearRegression', model = 'LR')
%%time
print('Using the 5-Fold Lasso Regression to train, validate and predict'); print('--'*60)
lasso_reg = Lasso(alpha = 0.01)
df_lasso = train_val_predict(lasso_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold LassoRegression', model = 'Lasso')
df = df_lr.append(df_lasso)
%%time
print('Using the 5-Fold Ridge Regression to train, validate and predict'); print('--'*60)
ridge_reg = Ridge(alpha = 0.01)
df_ridge = train_val_predict(ridge_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold RidgeRegression', model = 'Ridge')
df = df.append(df_ridge)
display(df)
%%time
print('Finding out the hyperparameters for Decision Tree and Random Forest with GridSearchCV'); print('--'*60)
best_params_grid = {}
# Decision Tree and Random Forest Regressor Hyperparameters Grid
param_grid = {'DecisionTree': {'criterion': ['mse', 'mae'], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, None]},
'RandomForest': {'bootstrap': [True, False], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, None],
'max_features': ['auto', 'sqrt'], 'n_estimators': [200, 400, 600, 800]}}
# Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state = random_state)
dt_reg_grid = GridSearchCV(dt_reg, param_grid['DecisionTree'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
dt_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['DecisionTree'] = dt_reg_grid.best_params_
# Random Forest Regressor
rf_reg = RandomForestRegressor(random_state = random_state)
rf_reg_grid = GridSearchCV(rf_reg, param_grid['RandomForest'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
rf_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['RandomForest'] = rf_reg_grid.best_params_
print(f'Best parameters for Decision Tree and Random Forest using GridSearchCV: {best_params_grid}')
%%time
print('Finding out the hyperparameters for Decision Tree and Random Forest with RandomizedSearchCV'); print('--'*60)
best_params_random = {}
# Decision Tree and Random Forest Regressor Hyperparameters Grid
param_grid = {'DecisionTree': {'criterion': ['mse', 'mae'], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, None]},
'RandomForest': {'bootstrap': [True, False], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, None],
'max_features': ['auto', 'sqrt'], 'n_estimators': [200, 400, 600, 800]}}
# Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state = random_state)
dt_reg_grid = RandomizedSearchCV(dt_reg, param_grid['DecisionTree'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
dt_reg_grid.fit(X_train_fe, y_train_fe)
best_params_random['DecisionTree'] = dt_reg_grid.best_params_
# Random Forest Regressor
rf_reg = RandomForestRegressor(random_state = random_state)
rf_reg_grid = RandomizedSearchCV(rf_reg, param_grid['RandomForest'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
rf_reg_grid.fit(X_train_fe, y_train_fe)
best_params_random['RandomForest'] = rf_reg_grid.best_params_
print(f'Best parameters for Decision Tree and Random Forest using RandomizedSearchCV: {best_params_random}')
%%time
print('Using the 5-Fold Decision Tree Regressor to train, validate and predict'); print('--'*60)
dt_reg = DecisionTreeRegressor(random_state = random_state)
df_reg = train_val_predict(dt_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold DecisionTree', model = 'DT')
df = df.append(df_reg)
%%time
print('Using the 5-Fold Decision Tree Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
dt_reg_grid = DecisionTreeRegressor(random_state = random_state, **best_params_grid['DecisionTree'])
df_reg_grid = train_val_predict(dt_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold DecisionTree GridSearchCV', model = 'DT')
df = df.append(df_reg_grid)
%%time
print('Using the 5-Fold Decision Tree Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
dt_reg_rand = DecisionTreeRegressor(random_state = random_state, **best_params_random['DecisionTree'])
df_reg_rand = train_val_predict(dt_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold DecisionTree RandomizedSearchCV', model = 'DT')
df = df.append(df_reg_rand)
display(df)
%%time
print('Using the 5-Fold Random Forest Regressor to train, validate and predict'); print('--'*60)
rf_reg = RandomForestRegressor(random_state = random_state)
df_reg = train_val_predict(rf_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold RandomForest', model = 'RF')
df = df.append(df_reg)
%%time
print('Using the 5-Fold Random Forest Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
rf_reg_grid = RandomForestRegressor(random_state = random_state, **best_params_grid['RandomForest'])
df_reg_grid = train_val_predict(rf_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold RandomForest GridSearchCV', model = 'RF')
df = df.append(df_reg_grid)
%%time
print('Using the 5-Fold Random Forest Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
rf_reg_rand = RandomForestRegressor(random_state = random_state, **best_params_random['RandomForest'])
df_reg_rand = train_val_predict(rf_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold RandomForest RandomizedSearchCV', model = 'RF')
df = df.append(df_reg_rand)
display(df)
%%time
print('Using the 5-Fold Bagging Regressor to train, validate and predict'); print('--'*60)
bag_reg = BaggingRegressor(random_state = random_state)
bag_reg = train_val_predict(bag_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold Bagging', model = 'Bag')
df = df.append(bag_reg)
%%time
# Bagging Regressor Hyperparameters Grid
print('Finding out the hyperparameters for Bagging Regressor with GridSearchCV'); print('--'*60)
param_grid = {'Bagging': {'base_estimator': [DecisionTreeRegressor(random_state = random_state, **best_params_grid['DecisionTree']), None],
'n_estimators': [100, 150, 200]}}
# Bagging Regressor
bag_reg = BaggingRegressor(random_state = random_state)
bag_reg_grid = GridSearchCV(bag_reg, param_grid['Bagging'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
bag_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['Bagging'] = bag_reg_grid.best_params_
print('Best parameters for Bagging Regressor using GridSearchCV: {}'.format(best_params_grid['Bagging']))
%%time
# Bagging Regressor Hyperparameters Grid with RandomizedSearchCV
print('Finding out the hyperparameters for Bagging with RandomizedSearchCV'); print('--'*60)
param_grid = {'Bagging': {'base_estimator': [DecisionTreeRegressor(random_state = random_state, **best_params_grid['DecisionTree']), None],
'n_estimators': [100, 150, 200]}}
# Bagging Regressor
bag_reg = BaggingRegressor(random_state = random_state)
bag_reg_rand = RandomizedSearchCV(bag_reg, param_grid['Bagging'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
bag_reg_rand.fit(X_train_fe, y_train_fe)
best_params_random['Bagging'] = bag_reg_rand.best_params_
print('Best parameters for Bagging Regressor using RandomizedSearchCV: {}'.format(best_params_random['Bagging']))
%%time
print('Using the 5-Fold Bagging Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
bag_reg_grid = BaggingRegressor(random_state = random_state, **best_params_grid['Bagging'])
df_reg_grid = train_val_predict(bag_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold Bagging using GridSearchCV', model = 'Bag')
df = df.append(df_reg_grid)
%%time
print('Using the 5-Fold Bagging Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
bag_reg_rand = BaggingRegressor(random_state = random_state, **best_params_random['Bagging'])
df_reg_rand = train_val_predict(bag_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold Bagging using RandomizedSearchCV', model = 'Bag')
df = df.append(df_reg_rand)
display(df)
%%time
# AdaBoost Regressor Hyperparameters with GridSearchCV
print('Finding out the hyperparameters for AdaBoostRegressor with GridSearchCV'); print('--'*60)
param_grid = {'AdaBoost': {'base_estimator': [DecisionTreeRegressor(random_state = random_state, **best_params_grid['DecisionTree']), None],
'n_estimators': [100, 150, 200], 'learning_rate': [0.01, 0.1, 1.0]}}
# AdaBoost Regressor
ada_reg = AdaBoostRegressor(random_state = random_state)
ada_reg_grid = GridSearchCV(ada_reg, param_grid['AdaBoost'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
ada_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['AdaBoost'] = ada_reg_grid.best_params_
print('Best parameters for AdaBoost Regressor using GridSearchCV: {}'.format(best_params_grid['AdaBoost']))
%%time
# AdaBoost Regressor Hyperparameters Grid with RandomizedSearchCV
print('Finding out the hyperparameters for AdaBoostRegressor with RandomizedSearchCV'); print('--'*60)
param_grid = {'AdaBoost': {'base_estimator': [DecisionTreeRegressor(random_state = random_state, **best_params_grid['DecisionTree']), None],
'n_estimators': [100, 150, 200], 'learning_rate': [0.01, 0.1, 1.0]}}
# AdaBoost Regressor
ada_reg = AdaBoostRegressor(random_state = random_state)
ada_reg_rand = RandomizedSearchCV(ada_reg, param_grid['AdaBoost'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
ada_reg_rand.fit(X_train_fe, y_train_fe)
best_params_random['AdaBoost'] = ada_reg_rand.best_params_
print('Best parameters for AdaBoost Regressor using RandomizedSearchCV: {}'.format(best_params_random['AdaBoost']))
%%time
print('Using the 5-Fold Ada Boost Regressor to train, validate and predict'); print('--'*60)
ada_reg = AdaBoostRegressor(random_state = random_state)
df_reg = train_val_predict(ada_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold AdaBoost', model = 'Ada')
df = df.append(df_reg)
%%time
print('Using the 5-Fold Ada Boost Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
ada_reg_grid = AdaBoostRegressor(random_state = random_state, **best_params_grid['AdaBoost'])
df_reg_grid = train_val_predict(ada_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold AdaBoost using GridSearchCV', model = 'Ada')
df = df.append(df_reg_grid)
%%time
print('Using the 5-Fold Ada Boost Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
ada_reg_rand = AdaBoostRegressor(random_state = random_state, **best_params_random['AdaBoost'])
df_reg_rand = train_val_predict(ada_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold AdaBoost using RandomizedSearchCV', model = 'Ada')
df = df.append(df_reg_rand)
display(df)
%%time
# GradientBoostRegressor Hyperparameters Grid with GriedSearchCV
print('Finding out the hyperparameters for GradientBoostRegressor with GridSearchCV'); print('--'*60)
param_grid = {'GradientBoost': {'max_depth': [5, 6, 7, 8, 9, 10, None], 'max_features': ['auto', 'sqrt'],
'n_estimators': [600, 800, 1000]}}
# GradientBoostRegressor
gb_reg = GradientBoostingRegressor(random_state = random_state)
gb_reg_grid = GridSearchCV(gb_reg, param_grid['GradientBoost'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
gb_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['GradientBoost'] = gb_reg_grid.best_params_
print('Best parameters for Gradient Boost Regressor using GridSearchCV: {}'.format(best_params_grid['GradientBoost']))
%%time
# GradientBoostRegressor Hyperparameters Grid with RndomizedSearchCV
print('Finding out the hyperparameters for GradientBoostRegressor with RandomizedSearchCV'); print('--'*60)
param_grid = {'GradientBoost': {'max_depth': [5, 6, 7, 8, 9, 10, None], 'max_features': ['auto', 'sqrt'],
'n_estimators': [600, 800, 1000]}}
# GradientBoostRegressor
gb_reg = GradientBoostingRegressor(random_state = random_state)
gb_reg_rand = RandomizedSearchCV(gb_reg, param_grid['GradientBoost'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
gb_reg_rand.fit(X_train_fe, y_train_fe)
best_params_random['GradientBoost'] = gb_reg_rand.best_params_
print('Best parameters for Gradient Boost Regressor using RandomizedSearchCV: {}'.format(best_params_random['GradientBoost']))
%%time
print('Using the 5-Fold Gradient Boost Regressor to train, validate and predict'); print('--'*60)
gb_reg = GradientBoostingRegressor(random_state = random_state)
df_reg = train_val_predict(gb_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold GradientBoost', model = 'GB')
df = df.append(df_reg)
%%time
print('Using the 5-Fold Gradient Boost Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
gb_reg_grid = GradientBoostingRegressor(random_state = random_state, **best_params_grid['GradientBoost'])
df_reg_grid = train_val_predict(gb_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold GradientBoost using GridSearchCV', model = 'GB')
df = df.append(df_reg_grid)
%%time
print('Using the 5-Fold Gradient Boost Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
gb_reg_rand = GradientBoostingRegressor(random_state = random_state, **best_params_random['GradientBoost'])
df_reg_rand = train_val_predict(gb_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold GradientBoost using RandomizedSearchCV', model = 'GB')
df = df.append(df_reg_rand)
display(df)
%%time
# ExtraTreesRegressor Hyperparameters Grid with GridSearchCV
print('Finding out the hyperparameters for ExtraTreesRegressor with GridSearchCV'); print('--'*60)
param_grid = {'ExtraTrees': {'max_depth': [5, 6, 7, 8, 9, 10, None], 'max_features': ['auto', 'sqrt'],
'n_estimators': [100, 600, 800, 1000]}}
# ExtraTreesRegressor
et_reg = ExtraTreesRegressor(random_state = random_state)
et_reg_grid = GridSearchCV(et_reg, param_grid['ExtraTrees'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
et_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['ExtraTrees'] = et_reg_grid.best_params_
print('Best parameters for Extra Trees Regressor using GridSearchCV: {}'.format(best_params_grid['ExtraTrees']))
%%time
# ExtraTreesRegressor Hyperparameters Grid with RandomizedSearchCV
print('Finding out the hyperparameters for ExtraTreesRegressor with RandomizedSearchCV'); print('--'*60)
param_grid = {'ExtraTrees': {'max_depth': [5, 6, 7, 8, 9, 10, None], 'max_features': ['auto', 'sqrt'],
'n_estimators': [100, 600, 800, 1000]}}
# ExtraTreesRegressor
et_reg = ExtraTreesRegressor(random_state = random_state)
et_reg_rand = RandomizedSearchCV(et_reg, param_grid['ExtraTrees'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
et_reg_rand.fit(X_train_fe, y_train_fe)
best_params_random['ExtraTrees'] = et_reg_rand.best_params_
print('Best parameters for Extra Trees Regressor using RandomizedSearchCV: {}'.format(best_params_random['ExtraTrees']))
%%time
print('Using the 5-Fold Extra Trees Regressor to train, validate and predict'); print('--'*60)
et_reg = ExtraTreesRegressor(random_state = random_state)
df_reg = train_val_predict(et_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold ExtraTrees', model = 'ET')
df = df.append(df_reg)
%%time
print('Using the 5-Fold Extra Trees Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
et_reg_grid = ExtraTreesRegressor(random_state = random_state, **best_params_grid['ExtraTrees'])
df_reg_grid = train_val_predict(et_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold ExtraTrees using GridSearchCV', model = 'ET')
df = df.append(df_reg_grid)
%%time
print('Using the 5-Fold Extra Trees Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
et_reg_rand = ExtraTreesRegressor(random_state = random_state, **best_params_random['ExtraTrees'])
df_reg_rand = train_val_predict(et_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold ExtraTrees using RandomizedSearchCV', model = 'ET')
df = df.append(df_reg_rand)
display(df)
%%time
print('Finding out the hyperparameters for CatBoost with GridSearch'); print('--'*60)
param_grid = {'CatBoost': {'learning_rate': np.arange(0.01, 0.31, 0.05), 'depth': [3, 4, 5, 6, 7, 8, 9, 10], 'l2_leaf_reg': np.arange(2, 10, 1)}}
# Cat Boost Regressor
cat_reg = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5)
best_params = cat_reg.grid_search(param_grid['CatBoost'], X = X_train_fe, y = y_train_fe, cv = 3, verbose = 150)
best_params_grid['CatBoostGridSearch'] = best_params['params']
%%time
print('Finding out the hyperparameters for CatBoost with RandomSearch'); print('--'*60)
param_grid = {'CatBoost': {'learning_rate': np.arange(0.01, 0.31, 0.05), 'depth': [3, 4, 5, 6, 7, 8, 9, 10], 'l2_leaf_reg': np.arange(2, 10, 1)}}
# Cat Boost Regressor
cat_reg = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5)
best_params = cat_reg.randomized_search(param_grid['CatBoost'], X = X_train_fe, y = y_train_fe, cv = 3, verbose = 150)
best_params_grid['CatBoostRandomSearch'] = best_params['params']
%%time
print('Using the 5-Fold CatBoost Regressor to train, validate and predict'); print('--'*60)
cb_reg = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5)
df_reg = train_val_predict(cb_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold CatBoost', model = 'CatBoost')
df = df.append(df_reg)
%%time
print('Using the 5-Fold CatBoost Regressor to train, validate and predict using GridSearch'); print('--'*60)
cb_reg_grid = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5, **best_params_grid['CatBoostGridSearch'])
df_reg_grid = train_val_predict(cb_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold CatBoost GridSearchCV', model = 'CatBoost')
df = df.append(df_reg_grid)
%%time
print('Using the 5-Fold CatBoost Regressor to train, validate and predict using RandomSearch'); print('--'*60)
cb_reg_rand = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5, **best_params_grid['CatBoostRandomSearch'], verbose = False)
df_reg_rand = train_val_predict(cb_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold CatBoost RandomSearchCV', model = 'CatBoost')
df = df.append(df_reg_rand)
display(df)
%%time
values = concrete1.values
n_iterations = 600 # Number of bootstrap samples to create
n_size = int(len(concrete1) * 1) # size of a bootstrap sample
# run bootstrap
stats = list() # empty list that will hold the scores for each bootstrap iteration
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples = n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()]) # picking rest of the data not considered in sample
# fit model
gb_reg_grid = GradientBoostingRegressor(random_state = random_state, **best_params_grid['GradientBoost'])
gb_reg_grid.fit(train[:, :-1], train[:, -1]) # fit against independent variables and corresponding target values
# evaluate model
predictions = gb_reg_grid.predict(test[:, :-1]) # predict based on independent variables in the test data
score = r2_score(test[:, -1], predictions)
stats.append(score)
# plot scores
plt.figure(figsize = (15, 7.2))
plt.hist(stats); plt.show()
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0 - alpha) / 2.0) * 100 # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))
p = (alpha + ((1.0 - alpha) / 2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
display(df)